feat: support option for continuous monitoring token usage in streaming response #111

sukumargaonkar · 2025-01-16T22:37:53Z

Currently only total_tokens usage from response body are pushed to dynamicMetadata. This PR updates that logic to include input and output token usage as well.

This PR also introduces monitorContinuousUsageStats flag in config for external process
The flag controls if external process monitors every response-body chunk for usage stats
when true, it will monitor for token metadata usage in every response-body chunk received during request in streaming mode (compatible with vllm's 'continuous_usage_stats' flag)
when false, it will stop monitoring after detecting token metadata usage after finding it for the first time.
(compatible with OpenAI's streaming response (https://platform.openai.com/docs/api-reference/chat/streaming#chat/streaming-usage))
Only affects request in streaming mode

mathetake · 2025-01-16T23:08:06Z

filterconfig/filterconfig.go

+	// MonitorContinuousUsageStats flag controls if external process monitors every response-body chunk for usage stats
+	// when true, it will monitor for token metadata usage in every response-body chunk received during request in streaming mode
+	// compatible with vllm's 'continuous_usage_stats' flag
+	// when false, it will stop monitoring after detecting token metadata usage after finding it for the first time.
+	// compatible with OpenAI's streaming response (https://platform.openai.com/docs/api-reference/chat/streaming#chat/streaming-usage)
+	// Only affects request in streaming mode
+	MonitorContinuousUsageStats bool `yaml:"monitorContinuousUsageStats,omitempty"`


could you remove the change related to this? I think this is another issue and metadata is not cumulative so basically it's overriding previous ones if it's emitted in the middle.

I think this should be a property on the AIServiceBackend as only certain backend supports this, e.g vLLM service backend.

internal/extproc/translator/translator.go

.gitignore

mathetake · 2025-01-16T23:50:03Z

sorry this feels a conflict with #103 i would appreciate it if you could stop it until it lands - I think the PR will supersede this PR besides vllm stuff part

mathetake · 2025-01-18T19:38:27Z

@sukumargaonkar thank you for waiting - #103 has landed so could you rework the PR and focus on the vllm stuff?

mathetake · 2025-01-21T20:01:01Z

ping

mathetake · 2025-01-23T15:33:44Z

@sukumargaonkar do you still want continue the PR here? I will close this in a few days if there's no response as there's no reason to keep it open

sukumargaonkar · 2025-01-23T16:14:23Z

yes, will rebase and include only the vllm specific changes

mathetake · 2025-01-23T17:01:13Z

great!

mathetake · 2025-01-25T06:52:04Z

checking - how long do you need to rework the pr here? @sukumargaonkar it should be pretty straightforward right? i wonder if this can get in the initial release.

Currently only total_tokens usage from response body are pushed to dynamicMetadata. This PR updates that logic to include input and output token usage as well. Signed-off-by: Sukumar Gaonkar <[email protected]>

netlify · 2025-01-27T01:48:52Z

✅ Deploy Preview for envoy-ai-gateway canceled.

Name	Link
🔨 Latest commit	`a4de9ad`
🔍 Latest deploy log	https://app.netlify.com/sites/envoy-ai-gateway/deploys/6796e5e6005b2000084daca2

mathetake

a couple of question:

do we really want to make this as an option? even if so, adding option only to filterconfig.go doesn't make sense.
how does this apply to aws?

yuzisun · 2025-01-28T02:56:58Z

a couple of question:

do we really want to make this as an option? even if so, adding option only to filterconfig.go doesn't make sense.

how does this apply to aws?

I made a comment above, it should be a flag for AIServiceBackend if continuous token usage monitoring is supported.

mathetake · 2025-01-28T03:26:16Z

@yuzisun I think i should've rephrased the question: why don't we just enable this logic by default (== removing the .bufferingDone flag)? In anyways, we parse all event by default (since the used token event is always the last one), so this option doesn't help us at all in terms of the computation. (On that note, i think i shouldn't have introduced bufferingDone in the firs tplace)

sukumargaonkar · 2025-01-28T19:47:23Z

i agree, its better to parse every msg chunk while streaming to check for presence of usage data
also noticed that in case of vllm, when usage data in included in each chunk it includes usage of all chunks before as well.

i don't think we need to do the aggregation here

ai-gateway/internal/extproc/processor.go

Lines 203 to 205 in 2c62874

    
           p.costs.InputTokens += tokenUsage.InputTokens 
        
           p.costs.OutputTokens += tokenUsage.OutputTokens 
        
           p.costs.TotalTokens += tokenUsage.TotalTokens

thoughts @mathetake ?

mathetake · 2025-01-28T19:49:45Z

also noticed that in case of vllm, when usage data in included in each chunk it includes usage of all chunks before as well.

i see that's a good finding. is there any documentation about that? then let's change the aggregation as well as removing the bufferingDone flag

mathetake · 2025-01-29T23:52:58Z

ping

mathetake · 2025-01-31T18:13:22Z

@sukumargaonkar hey are you still interested in this ? I would prefer the fast turnaround of a single PR and avoid keeping the context for a long time, so i would appreciate if you could rework the PR soon. Otherwise, i will close and redo it by myself. Thanks !

sukumargaonkar requested a review from a team as a code owner January 16, 2025 22:37

mathetake reviewed Jan 16, 2025

View reviewed changes

internal/extproc/translator/translator.go Outdated Show resolved Hide resolved

mathetake reviewed Jan 16, 2025

View reviewed changes

internal/extproc/translator/translator.go Outdated Show resolved Hide resolved

mathetake reviewed Jan 16, 2025

View reviewed changes

.gitignore Outdated Show resolved Hide resolved

mathetake mentioned this pull request Jan 17, 2025

api: RequestCost configurations #103

Merged

mathetake self-assigned this Jan 18, 2025

Extract Input/Output token usage from request.

a4de9ad

Currently only total_tokens usage from response body are pushed to dynamicMetadata. This PR updates that logic to include input and output token usage as well. Signed-off-by: Sukumar Gaonkar <[email protected]>

yuzisun force-pushed the token-useage branch from 9a4ad44 to a4de9ad Compare January 27, 2025 01:48

yuzisun changed the title ~~Extract Input/Output token usage from request.~~ feat: support option for continuous monitoring token usage in streaming response Jan 27, 2025

mathetake reviewed Jan 27, 2025

View reviewed changes

mathetake added this to the v.0.1.0 milestone Jan 29, 2025

missBerg added the feature label Jan 30, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support option for continuous monitoring token usage in streaming response #111

feat: support option for continuous monitoring token usage in streaming response #111

sukumargaonkar commented Jan 16, 2025

mathetake Jan 16, 2025 •

edited

Loading

yuzisun Jan 27, 2025

mathetake commented Jan 16, 2025 •

edited

Loading

mathetake commented Jan 18, 2025

mathetake commented Jan 21, 2025

mathetake commented Jan 23, 2025

sukumargaonkar commented Jan 23, 2025

mathetake commented Jan 23, 2025

mathetake commented Jan 25, 2025

netlify bot commented Jan 27, 2025 •

edited

Loading

mathetake left a comment

yuzisun commented Jan 28, 2025

mathetake commented Jan 28, 2025 •

edited

Loading

sukumargaonkar commented Jan 28, 2025

mathetake commented Jan 28, 2025

mathetake commented Jan 29, 2025

mathetake commented Jan 31, 2025 •

edited

Loading

feat: support option for continuous monitoring token usage in streaming response #111

Are you sure you want to change the base?

feat: support option for continuous monitoring token usage in streaming response #111

Conversation

sukumargaonkar commented Jan 16, 2025

mathetake Jan 16, 2025 • edited Loading

Choose a reason for hiding this comment

yuzisun Jan 27, 2025

Choose a reason for hiding this comment

mathetake commented Jan 16, 2025 • edited Loading

mathetake commented Jan 18, 2025

mathetake commented Jan 21, 2025

mathetake commented Jan 23, 2025

sukumargaonkar commented Jan 23, 2025

mathetake commented Jan 23, 2025

mathetake commented Jan 25, 2025

netlify bot commented Jan 27, 2025 • edited Loading

✅ Deploy Preview for envoy-ai-gateway canceled.

mathetake left a comment

Choose a reason for hiding this comment

yuzisun commented Jan 28, 2025

mathetake commented Jan 28, 2025 • edited Loading

sukumargaonkar commented Jan 28, 2025

mathetake commented Jan 28, 2025

mathetake commented Jan 29, 2025

mathetake commented Jan 31, 2025 • edited Loading

mathetake Jan 16, 2025 •

edited

Loading

mathetake commented Jan 16, 2025 •

edited

Loading

netlify bot commented Jan 27, 2025 •

edited

Loading

mathetake commented Jan 28, 2025 •

edited

Loading

mathetake commented Jan 31, 2025 •

edited

Loading